Goal is to classify the patients into the respective labels using the attributes from their voice recordings
Exploratory Data Analysis to predict the Patients who have been affected by Parkinsons Disease to accurately diagnosis PD, this would be an effective screening step prior to an appointment with a clinician
# Import Basic packages
import pandas as pd, numpy as np, matplotlib.pyplot as plt, seaborn as sns
from scipy import stats; from scipy.stats import zscore, norm, randint
import matplotlib.style as style; style.use('fivethirtyeight')
import plotly.express as px
%matplotlib inline
# Model Building - LR, KNN, NB,SVC, Decision Tree , Ensemble Models
##import scikit learn for Model Building
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score, f1_score, recall_score, precision_score
from sklearn.ensemble import GradientBoostingClassifier, RandomForestClassifier, BaggingClassifier
from sklearn import model_selection
from sklearn.model_selection import train_test_split, GridSearchCV, StratifiedKFold, cross_val_score
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import AdaBoostClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn import svm
from sklearn.ensemble import StackingClassifier
from sklearn import metrics
from sklearn.metrics import classification_report, confusion_matrix,accuracy_score
#Data Preprocessing to scale the data
from sklearn.preprocessing import RobustScaler, StandardScaler
from sklearn.pipeline import Pipeline
# Suppress warnings
import warnings; warnings.filterwarnings('ignore')
# Visualize Tree
from sklearn.tree import export_graphviz
from IPython.display import Image
from os import system
# Display settings
pd.options.display.max_rows = 10000
pd.options.display.max_columns = 10000
random_state = 42
np.random.seed(random_state)
data = pd.read_csv('https://archive.ics.uci.edu/ml/machine-learning-databases/parkinsons/parkinsons.data')
data.head()
#Reshaping the Target column = Status
#Drop and re-locate status at the end column in DataFrame
# Create copy of Dataframe for Data manipulation
pdata = data
target= pdata['status']
pdata.drop(['status'], axis = 1,inplace = True)
pdata['status'] = target
pdata.head()
#Open and read the File
pd_data = open("Data - Parkinsons", "r")
print(pd_data.read())
#To check the dimension or shape of the dataset
pdata.shape
This Voice recording dataset contains 195 obervations and 24 attributes
#status - Health status of the subject (one) - Parkinson's, (zero) - healthy
pdata.groupby('status').count()
Total count of Health status of the person (one) - Parkinson's : 147
Total count of Health status of the person (Zero) - Healthy : 48
# To view the data type and number of values entered in each of the Independent attributes and Dependent attribute
loan_bank.info()
pdata.info()
All the columns/attributes have 195 non-null values.
# display Number of Null values in each of the attribute
pdata.isnull().sum()
No Null values present in each of the Attribute
# check whether the column has any value other than numeric values
pdata.iloc[:,1:][~pdata.iloc[:,1:].applymap(np.isreal).all(1)]
All columns are Numeric attributes except Name column
#describe() show the summary of statistics about all numeric attributes.
pdata.describe().transpose()
MDVP:Fo(Hz) - Average vocal fundamental frequency's mean value is 154.228641
MDVP:Fhi(Hz) - Maximum vocal fundamental frequency's maximum value is 592.030000
MDVP:Flo(Hz) - Minimum vocal fundamental frequency's minimum value is 0.001680
PPE attribute(nonlinear measures of fundamental frequency) spreads 75% of the data point in the range of around 0.252980 Status - Maximum rate of the healthy status of the subject which indicates more persons affected by Parkinsons Disease
#Check Correlation of all Attributes
pdata.corr()
pdata.kurtosis(numeric_only = True)
Kurtosis with positive values indicates that those attributes have more data points around the tail
pdata.skew(numeric_only = True)
Skewness with positive values indicates data is skewed towards right. Skewness with negative values indicates data is skewed towards left
print("The average vocal fundamental frequency of person is {:.2f} and \n 90% of the people have a Fo of around {:.2f}".format(pdata['MDVP:Fo(Hz)'].mean(),pdata['MDVP:Fo(Hz)'].quantile(0.90)))
pdata['MDVP:Fo(Hz)'].plot(kind='box');
No outliers present for MDVP:Fo(Hz)
print('Skewness :',pdata['MDVP:Fo(Hz)'].skew())
print('Kurtosis :',pdata['MDVP:Fo(Hz)'].kurtosis())
sns.distplot(pdata['MDVP:Fo(Hz)'],kde = True,rug = True);
The skweness value is positive hence the data is skewed towards right side
The kurtosis value is negative hence less data points are around the tail
print("The maximum vocal fundamental frequency of a person is {:.2f} and \n 90% of the people have a Fhi of {:.2f}".format(pdata['MDVP:Fhi(Hz)'].mean(),pdata['MDVP:Fhi(Hz)'].quantile(0.90)))
print(pdata['MDVP:Fhi(Hz)'].head(10))
pdata['MDVP:Fhi(Hz)'].plot(kind='box');
More number of outliers present for MDVP:Fhi(Hz)
print('Skewness :',pdata['MDVP:Fhi(Hz)'].skew())
print('Kurtosis :',pdata['MDVP:Fhi(Hz)'].kurtosis())
sns.distplot(pdata['MDVP:Fhi(Hz)'],kde = True,rug = True);
The Skewness value is positive hence the data is skewed towards right side
The Kurtosis value is postive hence more data points are around the tail
#Outlier Treatment
q3 = pdata['MDVP:Fhi(Hz)'].quantile(0.75)
q1 = pdata['MDVP:Fhi(Hz)'].quantile(0.25)
iqr = q3-q1
out_above = q3+iqr
out_below = q1-iqr
print("outliers_above : {}".format(out_above))
print("outliers_below : {}".format(out_below))
print("Total observations above outlier :",pdata['MDVP:Fhi(Hz)'].loc[pdata['MDVP:Fhi(Hz)']>out_above].count())
print("Total observations below outlier : ",pdata['MDVP:Fhi(Hz)'].loc[pdata['MDVP:Fhi(Hz)']<out_below].count())
print("Data points above outlier :\n",pdata['MDVP:Fhi(Hz)'].loc[pdata['MDVP:Fhi(Hz)']>out_above])
mean_val = pdata['MDVP:Fhi(Hz)'].loc[pdata['MDVP:Fhi(Hz)']<=out_above].mean()
pdata['MDVP:Fhi(Hz)'] = pdata['MDVP:Fhi(Hz)'].mask(pdata['MDVP:Fhi(Hz)']>out_above,mean_val)
print("After Outlier Treatment")
print(pdata['MDVP:Fhi(Hz)'].head(10))
pdata['MDVP:Fhi(Hz)'].plot(kind='box');
print('Skewness :',pdata['MDVP:Fhi(Hz)'].skew())
print('Kurtosis :',pdata['MDVP:Fhi(Hz)'].kurtosis())
sns.distplot(pdata['MDVP:Fhi(Hz)'],kde = True,rug = True);
After outlier treatment kurtosis indicates that less number of data points are around the tail
print("The minimum vocal fundamental frequency of a person is {:.2f} and \n 90% of the people have a Flo of {:.2f}".format(pdata['MDVP:Flo(Hz)'].mean(),pdata['MDVP:Flo(Hz)'].quantile(0.90)))
print(pdata['MDVP:Flo(Hz)'].head(10))
pdata['MDVP:Flo(Hz)'].plot(kind='box');
More number of Ouliers present for MDVP:Flo(Hz)
print('Skewness : ',pdata['MDVP:Flo(Hz)'].skew())
print('Kurtosis : ',pdata['MDVP:Flo(Hz)'].kurtosis())
sns.distplot(pdata['MDVP:Flo(Hz)'],kde = True,rug = True);
The skewnessvalue is positive hence the data is skewed towards right side
The kurtosis value is postive hence the more data points are around the tail
#Outlier Treatment
q3 = pdata['MDVP:Flo(Hz)'].quantile(0.75)
q1 = pdata['MDVP:Flo(Hz)'].quantile(0.25)
iqr = q3-q1
out_above = q3+iqr
out_below = q1-iqr
print("outliers_above : {}".format(out_above))
print("outliers_below : {}".format(out_below))
print("Total observations above outlier :",pdata['MDVP:Flo(Hz)'].loc[pdata['MDVP:Flo(Hz)']>out_above].count())
print("Total observations below outlier : ",pdata['MDVP:Flo(Hz)'].loc[pdata['MDVP:Flo(Hz)']<out_below].count())
print("Data points above outlier :\n",pdata['MDVP:Flo(Hz)'].loc[pdata['MDVP:Flo(Hz)']>out_above])
max_val = pdata['MDVP:Flo(Hz)'].loc[pdata['MDVP:Flo(Hz)']<=out_above].max()
pdata['MDVP:Flo(Hz)'] = pdata['MDVP:Flo(Hz)'].mask(pdata['MDVP:Flo(Hz)']>out_above,max_val)
print("After Outlier treatment")
print(pdata['MDVP:Flo(Hz)'].head(10))
print(pdata['MDVP:Flo(Hz)'].plot(kind='box'));
print('Skewness : ',pdata['MDVP:Flo(Hz)'].skew())
print('Kurtosis : ',pdata['MDVP:Flo(Hz)'].kurtosis())
sns.distplot(pdata['MDVP:Flo(Hz)'], kde= True, rug = True);
After outlier treatment kurtosis indicates less number of data points around the tail
MDVP:Jitter(%)
print(pdata['MDVP:Jitter(%)'].head(10))
pdata['MDVP:Jitter(%)'].plot(kind='box');
More number of outliers present for MDVP:Jitter(%)
print("The minimum vocal fundamental frequency of a person is {:.2f} and \n 90% of the people have a Jitter of {:.2f}".format(pdata['MDVP:Jitter(%)'].mean(),pdata['MDVP:Jitter(%)'].quantile(0.90)))
print('Skewness :',pdata['MDVP:Jitter(%)'].skew())
print('Kurtosis :',pdata['MDVP:Jitter(%)'].kurtosis())
sns.distplot(pdata['MDVP:Jitter(%)'],kde= True , rug= True);
The skweness value is positive hence the data is skewed towards right side
The Kurtosis value is positive hence more data points around the tail
#Outlier Treatment
q3 = pdata['MDVP:Jitter(%)'].quantile(0.75)
q1 = pdata['MDVP:Jitter(%)'].quantile(0.25)
iqr = q3-q1
out_above = q3+iqr
out_below = q1-iqr
print("outliers_above : {}".format(out_above))
print("outliers_below : {}".format(out_below))
print("Total observations above outlier",pdata['MDVP:Jitter(%)'].loc[pdata['MDVP:Jitter(%)']>out_above].count())
print("Total observations below outlier",pdata['MDVP:Jitter(%)'].loc[pdata['MDVP:Jitter(%)']<out_below].count())
print("Data points above Outlier limit")
print(pdata['MDVP:Jitter(%)'].loc[pdata['MDVP:Jitter(%)']>out_above])
max_val = pdata['MDVP:Jitter(%)'].loc[pdata['MDVP:Jitter(%)']<=out_above].max()
pdata['MDVP:Jitter(%)'] = pdata['MDVP:Jitter(%)'].mask(pdata['MDVP:Jitter(%)']>out_above,max_val)
print("After outlier Treatment")
print(pdata['MDVP:Jitter(%)'].head(10))
print(pdata['MDVP:Jitter(%)'].plot(kind='box'));
print('Skewness : ',pdata['MDVP:Jitter(%)'].skew())
print('Kurtosis : ',pdata['MDVP:Jitter(%)'].kurtosis())
sns.distplot(pdata['MDVP:Jitter(%)'],kde= True , rug = True);
After outlier treatment kurtosis indicates less number of data points around the tail
print(pdata['MDVP:Jitter(Abs)'].head(10))
pdata['MDVP:Jitter(Abs)'].plot(kind='box');
print('Skewness : ',pdata['MDVP:Jitter(Abs)'].skew())
print('kurtosis : ',pdata['MDVP:Jitter(Abs)'].kurtosis())
sns.distplot(pdata['MDVP:Jitter(Abs)'],kde = True, rug =True);
#Outlier Treatment
q3 = pdata['MDVP:Jitter(Abs)'].quantile(0.75)
q1 = pdata['MDVP:Jitter(Abs)'].quantile(0.25)
iqr = q3-q1
out_above = q3+iqr
out_below = q1-iqr
print("outliers_above : {}".format(out_above))
print("outliers_below : {}".format(out_below))
print("Total observations above outlier",pdata['MDVP:Jitter(Abs)'].loc[pdata['MDVP:Jitter(Abs)']>out_above].count())
print("Total observations below outlier",pdata['MDVP:Jitter(Abs)'].loc[pdata['MDVP:Jitter(Abs)']<out_below].count())
print("Data points above Outlier limit")
print(pdata['MDVP:Jitter(Abs)'].loc[pdata['MDVP:Jitter(Abs)']>out_above])
mean_val = pdata['MDVP:Jitter(Abs)'].loc[pdata['MDVP:Jitter(Abs)']<=out_above].mean()
pdata['MDVP:Jitter(Abs)'] = pdata['MDVP:Jitter(Abs)'].mask(pdata['MDVP:Jitter(Abs)']>out_above,mean_val)
print("After Outlier Treatment")
print(pdata['MDVP:Jitter(Abs)'].head(10))
print(pdata['MDVP:Jitter(Abs)'].plot(kind='box'));
print('skewness : ',pdata['MDVP:Jitter(Abs)'].skew())
print('Kurtosis : ',pdata['MDVP:Jitter(Abs)'].kurtosis())
sns.distplot(pdata['MDVP:Jitter(Abs)'],kde = True ,rug= True);
After outlier treatment kurtosis indicates less number of data points around the tail
pdata['MDVP:RAP'].plot(kind='box');
print('Skewness : ',pdata['MDVP:RAP'].skew())
print('Kurtosis : ',pdata['MDVP:RAP'].kurtosis())
sns.distplot(pdata['MDVP:RAP'],kde=True , rug =True);
The skewness value is positive hence the data is skewed towards right side
The kurtosis value is positive hence more data points are around the tail
#Outlier Treatment
q3 = pdata['MDVP:RAP'].quantile(0.75)
q1 = pdata['MDVP:RAP'].quantile(0.25)
iqr = q3-q1
out_above = q3+iqr
out_below = q1-iqr
print("outliers_above : {}".format(out_above))
print("outliers_below : {}".format(out_below))
print("Total observations above outlier",pdata['MDVP:RAP'].loc[pdata['MDVP:RAP']>out_above].count())
print("Total observations below outlier",pdata['MDVP:RAP'].loc[pdata['MDVP:RAP']<out_below].count())
print("Data points above Outlier limit")
print(pdata['MDVP:RAP'].loc[pdata['MDVP:RAP']>out_above])
max_val = pdata['MDVP:RAP'].loc[pdata['MDVP:RAP']<=out_above].max()
pdata['MDVP:RAP'] = pdata['MDVP:RAP'].mask(pdata['MDVP:RAP']>out_above,max_val)
print("After Outlier Treatment")
print(pdata['MDVP:RAP'].head(10))
print(pdata['MDVP:RAP'].plot(kind='box'));
print('Skewness : ',pdata['MDVP:RAP'].skew())
print('Kurtosis : ',pdata['MDVP:RAP'].kurtosis())
sns.distplot(pdata['MDVP:RAP'],kde=True , rug =True);
After outlier treatment kurtosis indicates less number of data points around the tail
pdata['MDVP:PPQ'].plot(kind='box');
More number of Outliers present
print('Skewness : ',pdata['MDVP:PPQ'].skew())
print('Kurtosis : ',pdata['MDVP:PPQ'].kurtosis())
sns.distplot(pdata['MDVP:PPQ'],kde=True , rug =True);
The skewnessvalue is positive hence the data is skewed towards right side
The kurtosis value is positive hence more data points are around the tail
#Outlier Treatment
q3 = pdata['MDVP:PPQ'].quantile(0.75)
q1 = pdata['MDVP:PPQ'].quantile(0.25)
iqr = q3-q1
out_above = q3+iqr
out_below = q1-iqr
print("outliers_above : {}".format(out_above))
print("outliers_below : {}".format(out_below))
print("Total observations above outlier",pdata['MDVP:PPQ'].loc[pdata['MDVP:PPQ']>out_above].count())
print("Total observations below outlier",pdata['MDVP:PPQ'].loc[pdata['MDVP:PPQ']<out_below].count())
print("Data points above Outlier limit")
print(pdata['MDVP:PPQ'].loc[pdata['MDVP:PPQ']>out_above])
max_val = pdata['MDVP:PPQ'].loc[pdata['MDVP:PPQ']<=out_above].max()
pdata['MDVP:PPQ'] = pdata['MDVP:PPQ'].mask(pdata['MDVP:PPQ']>out_above,max_val)
print("After Outlier Treatment")
print(pdata['MDVP:PPQ'].head(10))
print(pdata['MDVP:PPQ'].plot(kind='box'));
print('Skewness : ',pdata['MDVP:PPQ'].skew())
print('Kurtosis : ',pdata['MDVP:PPQ'].kurtosis())
sns.distplot(pdata['MDVP:PPQ'],kde=True , rug =True);
After outlier treatment kurtosis indicates less number of data points around the tail
pdata['Jitter:DDP'].plot(kind='box');
print('Skewness : ',pdata['Jitter:DDP'].skew())
print('Kurtosis : ',pdata['Jitter:DDP'].kurtosis())
sns.distplot(pdata['Jitter:DDP'],kde=True , rug =True);
The skewness value is positive hence the data is skewed towards right side
The kurtosis value is postive hence more data points are around the tail
#Outlier Treatment
q3 = pdata['Jitter:DDP'].quantile(0.75)
q1 = pdata['Jitter:DDP'].quantile(0.25)
iqr = q3-q1
out_above = q3+iqr
out_below = q1-iqr
print("outliers_above : {}".format(out_above))
print("outliers_below : {}".format(out_below))
print("Total observations above outlier",pdata['Jitter:DDP'].loc[pdata['Jitter:DDP']>out_above].count())
print("Total observations below outlier",pdata['Jitter:DDP'].loc[pdata['Jitter:DDP']<out_below].count())
print("Data points above Outlier limit")
print(pdata['Jitter:DDP'].loc[pdata['Jitter:DDP']>out_above])
max_val = pdata['Jitter:DDP'].loc[pdata['Jitter:DDP']<=out_above].max()
pdata['Jitter:DDP'] = pdata['Jitter:DDP'].mask(pdata['Jitter:DDP']>out_above,max_val)
print("After Outlier Treatment")
print(pdata['Jitter:DDP'].head(10))
print(pdata['Jitter:DDP'].plot(kind='box'));
print('Skewness : ',pdata['Jitter:DDP'].skew())
print('Kurtosis : ',pdata['Jitter:DDP'].kurtosis())
sns.distplot(pdata['Jitter:DDP'],kde=True , rug =True);
After outlier treatment kurtosis indicates less number of data points around the tail
#Analysis of Shimmer
affected_MDVP = pdata[pdata['status']==1]['MDVP:Shimmer(dB)'].values
not_affected_MDVP = pdata[pdata['status']==0]['MDVP:Shimmer(dB)'].values
sns.distplot(affected_MDVP);
plt.title('Shimmer values for affected cases')
plt.xlabel('Shimmer values in DB per affected cases')
plt.show()
sns.boxplot(affected_MDVP);
plt.title('Shimmer values for affected cases')
plt.xlabel('Shimmer values in DB per affected cases')
plt.show()
sns.distplot(not_affected_MDVP);
plt.title('Shimmer values for not affected cases')
plt.xlabel('Shimmer values in DB per not affected cases')
plt.show()
sns.boxplot(not_affected_MDVP);
plt.title('Shimmer values for not affected cases')
plt.xlabel('Shimmer values in DB per not affected cases')
plt.show()
sns.FacetGrid(pdata, hue="status", size=5).map(sns.distplot, "MDVP:Shimmer(dB)").add_legend();
plt.show()
pdata['spread1'].plot(kind='box');
print('Skewness : ',pdata['spread1'].skew())
print('Kurtosis : ',pdata['spread1'].kurtosis())
sns.distplot(pdata['spread1'],kde=True , rug =True);
pdata['spread2'].plot(kind='box');
print('Skewness : ',pdata['spread2'].skew())
print('Kurtosis : ',pdata['spread2'].kurtosis())
sns.distplot(pdata['spread2'],kde=True , rug =True);
pdata['PPE'].plot(kind='box');
print('Skewness : ',pdata['PPE'].skew())
print('Kurtosis : ',pdata['PPE'].kurtosis())
sns.distplot(pdata['PPE'],kde=True , rug =True);
#status - Health status of the subject (one) - Parkinson's, (zero) - healthy
pd.crosstab(pdata['status'],columns='count')
#Target Column Distribution
sns.countplot(pdata['status']);
From the status (target column distribution), high number of patients affected by Parkinson Disease this would be an effective screening step prior to an appointment with a clinician.
#Bivaraiate Analysis to determine the relationship between independent attribute and target column
for i in pdata:
if i != 'status' and i != 'name':
sns.catplot(x="status",y=i,kind ='box',data=pdata);
it is very clear that if a patient has a lower rate of 'HNR','MDVP:Flo(Hz)','MDVP:Fhi(Hz)','MDVP:Fo(Hz)' ,then Patient is affected by parkinsons disease.
#Bivariate Distribution of Target column (Status) with respect to all other Independent Numeric attributes
#Using Scatter Plot
plt.figure(figsize=(10,20))
plt.subplot(6,1,1)
sns.scatterplot(pdata['MDVP:Fo(Hz)'],pdata['MDVP:Fhi(Hz)'], hue = pdata['status'], palette= ['red','blue']);
plt.subplot(6,1,2)
sns.scatterplot(pdata['MDVP:Fo(Hz)'],pdata['MDVP:Flo(Hz)'] , hue = pdata['status'], palette= ['blue','green']);
plt.subplot(6,1,3)
sns.scatterplot(pdata['MDVP:Jitter(%)'], pdata['MDVP:Jitter(Abs)'], hue =pdata['status'], palette= ['green','yellow']);
plt.subplot(6,1,4)
sns.scatterplot(pdata['MDVP:RAP'],pdata['MDVP:PPQ'], hue = pdata['status'], palette= ['green','red']);
plt.subplot(6,1,5)
sns.scatterplot(pdata['NHR'],pdata['HNR'], hue = pdata['status'], palette= ['magenta','yellow']);
plt.subplot(6,1,6)
sns.scatterplot(pdata['RPDE'],pdata['D2'], hue = pdata['status'], palette= ['cyan','blue']);
plt.figure(figsize=(10,20))
plt.subplot(2,1,1)
sns.scatterplot(pdata['spread1'],pdata['PPE'], hue = pdata['status'], palette= ['red','blue']);
plt.subplot(2,1,2)
sns.scatterplot(pdata['spread2'],pdata['PPE'] , hue = pdata['status'], palette= ['blue','green']);
While seeing the relationship between the nonlinear measures of fundamental frequency attributes,
PPE -Spread1 and PPE -Spread2: it shows that the Patients highly affected by Parkinsons Disease.
PPE is the most important attribute to predict the target class (Status)
#Pair plot which shows the bivaraiate distribution using scatter plot and univaraite distribution using Histograms.
sns.pairplot(pdata, hue = "status",diag_kind="kde");
#Use correlation method to observe the relationship between different attributes.
#Apply HeatMap to check the relationship between different attributes.
plt.figure(figsize=(10,8))
sns.heatmap(pdata.corr(),
annot=True,
linewidths=.5,
center=0,
cbar=False,
cmap="YlGnBu");
plt.show()
pdata.kurtosis(numeric_only = True)
pdata.skew(numeric_only = True)
from sklearn.model_selection import train_test_split
# Setting Independent features
X = pdata.drop(['status','name'], axis = 1)
#Set Target class label
y = pdata['status']
# Splitting the data into training and test set in the ratio of 70:30 respectively
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state = random_state)
X_train.head()
#Display Target column's train data
pd.crosstab(y_train,columns='count',colnames=['Train data'])
# Shape and size of Training dataset
print("Training data size\n",X_train.shape,y_train.shape)
#Display the Independent features test dataset
X_test.head()
#Display Target column's Test data
pd.crosstab(y_test,columns='count',colnames=['Test data'])
#Print Test data size
print("\nTesting data size\n",X_test.shape,y_test.shape)
#check split of dataset
print("{0:0.2f}% data is in training set".format((len(X_train)/len(pdata.index)) * 100))
print("{0:0.2f}% data is in test set".format((len(X_test)/len(pdata.index)) * 100))
#Detailed Summary count of Original, Train and Test DataSet
print(" Parkinsons Disease Affected Person count: {0} ({1:0.2f}%)".format(len(pdata.loc[pdata['status'] == 1]), (len(pdata.loc[pdata['status'] == 1])/len(pdata.index)) * 100))
print("Parkinsons Disease not affected Person Count : {0} ({1:0.2f}%)".format(len(pdata.loc[pdata['status'] == 0]), (len(pdata.loc[pdata['status'] == 0])/len(pdata.index)) * 100))
print("")
print("Training data- Parkinsons Disease affected Person count : {0} ({1:0.2f}%)".format(len(y_train[y_train[:] == 1]), (len(y_train[y_train[:] == 1])/len(y_train)) * 100))
print("Training data -Parkinsons Disease not affected Person count : {0} ({1:0.2f}%)".format(len(y_train[y_train[:] == 0]), (len(y_train[y_train[:] == 0])/len(y_train)) * 100))
print("")
print("Testing data- Parkinsons Disease affected Person count: {0} ({1:0.2f}%)".format(len(y_test[y_test[:] == 1]), (len(y_test[y_test[:] == 1])/len(y_test)) * 100))
print("Testing data- Parkinsons Disease not affected Person count : {0} ({1:0.2f}%)".format(len(y_test[y_test[:] == 0]), (len(y_test[y_test[:] == 0])/len(y_test)) * 100))
print("")
# Applying RobustScaler method to make it less prone to outliers
from sklearn.preprocessing import RobustScaler
features = X.columns
#RobustScaler() scales features using IQR that are robust to outliers
scaler = RobustScaler()
X = pd.DataFrame(scaler.fit_transform(X), columns = features)
# Scaling the independent variables
Xscale = X.apply(zscore)
display(X.shape, Xscale.shape, y.shape)
#Apply Standard scaler method to Standardize features by removing the mean and scaling to unit variance
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)
#Create Logistic Regression Model
LR = LogisticRegression(solver="liblinear")
LR.fit(X_train, y_train)
#predict target class on test data
y_pred = LR.predict(X_test)
accuracy_LR = accuracy_score(y_test, y_pred)
print('Training Score: ', LR.score(X_train, y_train).round(3))
print('Test Score: ', LR.score(X_test, y_test).round(3))
print('Classification Report of LR :')
print(classification_report(y_test,y_pred))
print('Accuracy: ', accuracy_LR.round(3))
#Print Confusion Matrix
cm_LR = metrics.confusion_matrix(y_test, y_pred)
label = ["Parkinsons Disease Affected", " Parkinsons Disease not Affected"]
cm1_LR = pd.DataFrame(cm_LR, index = label, columns = label)
sns.heatmap(cm1_LR, annot = True, fmt = "d")
plt.title("Confusion Matrix")
plt.xlabel("Predicted Label")
plt.ylabel("True Label")
plt.show()
#Store the accuracy results for each model in a dataframe for final comparison
resultDf = pd.DataFrame({'Method':['Logistic Regression'], 'accuracy': [accuracy_LR] })
resultDf = resultDf[['Method', 'accuracy']]
resultDf
#Import Gaussian Naive Bayes model
from sklearn.naive_bayes import GaussianNB
NB = GaussianNB()
NB.fit(X_train, y_train)
#predict target class on test data
y_pred = NB.predict(X_test)
accuracy_NB = accuracy_score(y_test, y_pred)
print('Training Score: ', NB.score(X_train, y_train).round(3))
print('Test Score: ', NB.score(X_test, y_test).round(3))
print('Classification Report of Gaussian Naive Bayes model :')
print(classification_report(y_test,y_pred))
print('Accuracy: ', accuracy_NB.round(3))
#Print Confusion Matrix
cm_NB = metrics.confusion_matrix(y_test, y_pred)
label = ["Parkinsons Disease Affected", " Parkinsons Disease not Affected"]
cm1_NB = pd.DataFrame(cm_NB, index = label, columns = label)
sns.heatmap(cm1_NB, annot = True, fmt = "d")
plt.title("Confusion Matrix")
plt.xlabel("Predicted Label")
plt.ylabel("True Label")
plt.show()
tempResultDf = pd.DataFrame({'Method':['Naive Bayes'], 'accuracy': [accuracy_NB]})
resultDf = pd.concat([resultDf, tempResultDf])
resultDf = resultDf[['Method', 'accuracy']]
resultDf
#Build SVM Model
from sklearn.svm import SVC
SVM = SVC(gamma=0.025, C=3)
SVM.fit(X_train , y_train)
#predict target class on test data
y_pred = SVM.predict(X_test)
accuracy_SVM = accuracy_score(y_test, y_pred)
print('Training Score: ', SVM.score(X_train, y_train).round(3))
print('Test Score: ', SVM.score(X_test, y_test).round(3))
print('Classification Report of SVM model :')
print(classification_report(y_test,y_pred))
print('Accuracy: ', accuracy_SVM.round(3))
cm_SVM = metrics.confusion_matrix(y_test, y_pred)
label = ["Parkinsons Disease Affected", " Parkinsons Disease not Affected"]
cm1_SVM = pd.DataFrame(cm_SVM, index = label, columns = label)
sns.heatmap(cm1_SVM, annot = True, fmt = "d")
plt.title("Confusion Matrix")
plt.xlabel("Predicted Label")
plt.ylabel("True Label")
plt.show()
tempResultDf = pd.DataFrame({'Method':['SVM'], 'accuracy': [accuracy_SVM]})
resultDf = pd.concat([resultDf, tempResultDf])
resultDf = resultDf[['Method', 'accuracy']]
resultDf
from sklearn.neighbors import KNeighborsClassifier
#Build KNN model
KNN = KNeighborsClassifier()
KNN.fit(X_train, y_train)
#predict target class on test data
y_pred = KNN.predict(X_test)
accuracy_KNN = accuracy_score(y_test, y_pred)
print('Training Score: ', KNN.score(X_train, y_train).round(3))
print('Test Score: ', KNN.score(X_test, y_test).round(3))
print('Classification Report of KNN Classifier Model :')
print(classification_report(y_test,y_pred))
print('Accuracy: ', accuracy_KNN.round(3))
cm_KNN = metrics.confusion_matrix(y_test, y_pred)
label = ["Parkinsons Disease Affected", " Parkinsons Disease not Affected"]
cm1_KNN = pd.DataFrame(cm_KNN, index = label, columns = label)
sns.heatmap(cm1_KNN, annot = True, fmt = "d")
plt.title("Confusion Matrix")
plt.xlabel("Predicted Label")
plt.ylabel("True Label")
plt.show()
tempResultDf = pd.DataFrame({'Method':['KNN'], 'accuracy': [accuracy_KNN]})
resultDf = pd.concat([resultDf, tempResultDf])
resultDf = resultDf[['Method', 'accuracy']]
resultDf
#Build Decision Tree Classifier
#Prune the decision tree by limiting the max. depth of trees to avoid over-fitting
DT = DecisionTreeClassifier(criterion = "gini", random_state = random_state,max_depth=3, min_samples_leaf=5)
DT.fit(X_train, y_train)
#predict target class on test data
y_pred = DT.predict(X_test)
feature_cols = X.columns
accuracy_DT = accuracy_score(y_test, y_pred)
print('Training Score: ', DT.score(X_train, y_train).round(3))
print('Test Score: ', DT.score(X_test, y_test).round(3))
print('Classification Report of Decision Tree classifier :')
print(classification_report(y_test,y_pred))
print('Accuracy: ', accuracy_DT.round(3))
cm_DT = metrics.confusion_matrix(y_test, y_pred)
label = ["Parkinsons Disease Affected", " Parkinsons Disease not Affected"]
cm1_DT = pd.DataFrame(cm_DT, index = label, columns = label)
sns.heatmap(cm1_DT, annot = True, fmt = "d")
plt.title("Confusion Matrix")
plt.xlabel("Predicted Label")
plt.ylabel("True Label")
plt.show()
tempResultDf = pd.DataFrame({'Method':['Decision Tree'], 'accuracy': [accuracy_DT]})
resultDf = pd.concat([resultDf, tempResultDf])
resultDf = resultDf[['Method', 'accuracy']]
resultDf
print('Feature Importance for Decision Tree ', '--'*38)
feature_importances = pd.DataFrame(DT.feature_importances_, index = X.columns,
columns=['Importance']).sort_values('Importance', ascending = True)
feature_importances.sort_values(by = 'Importance', ascending = True).plot(kind = 'barh', figsize = (15, 7.2))
Single Decision Tree employed to find the accuracy score but it depends mostly on PPE attribute to predict the target class. It doesn't use the other attributes . Disadvanatage : Low Feature selection
#Stacking is designed to improve modeling performance
#Train a meta-classifier
level0 = list()
level0.append(('LR', LR))
level0.append(('KNN', KNN ))
level0.append(('CART', DT ))
level0.append(('SVM', SVM ))
level0.append(('Naive Bayes', NB ))
# define meta learner model
#Classification Meta-Model: Use Logistic Regression.
level1 = LR
# define the stacking ensemble
model = StackingClassifier(estimators=level0, final_estimator=level1, cv=5)
# fit the model
model.fit(X, y)
#predict Target class on test data in Meta Classifier Model
y_pred = model.predict(X_test)
accuracy_meta = accuracy_score(y_test, y_pred)
print('Training Score: ', model.score(X_train, y_train).round(3))
print('Test Score: ', model.score(X_test, y_test).round(3))
print('Classification Report of Meta Classifier model :')
print(classification_report(y_test,y_pred))
print('Accuracy: ', accuracy_meta.round(3))
cm_meta = metrics.confusion_matrix(y_test, y_pred)
label = ["Parkinsons Disease Affected", " Parkinsons Disease not Affected"]
cm1_meta = pd.DataFrame(cm_meta, index = label, columns = label)
sns.heatmap(cm1_meta, annot = True, fmt = "d")
plt.title("Confusion Matrix")
plt.xlabel("Predicted Label")
plt.ylabel("True Label")
plt.show()
tempResultDf = pd.DataFrame({'Method':['Meta Classifier'], 'accuracy': [accuracy_meta]})
resultDf = pd.concat([resultDf, tempResultDf])
resultDf = resultDf[['Method', 'accuracy']]
resultDf
#Apply the Random forest model and print the accuracy of Random forest Model
from sklearn.ensemble import RandomForestClassifier
RF = RandomForestClassifier(n_estimators = 100 ,random_state = random_state)
RF.fit(X_train, y_train)
#predict target class on test data
y_pred = RF.predict(X_test)
accuracy_RF = accuracy_score(y_test, y_pred)
print('Training Score: ', RF.score(X_train, y_train).round(3))
print('Test Score: ', RF.score(X_test, y_test).round(3))
print('Classification Report of Random Forest model :')
print(classification_report(y_test,y_pred))
print('Accuracy: ', accuracy_RF.round(3))
cm_RF = metrics.confusion_matrix(y_test, y_pred)
label = ["Parkinsons Disease Affected", " Parkinsons Disease not Affected"]
cm1_RF = pd.DataFrame(cm_RF, index = label, columns = label)
sns.heatmap(cm1_RF, annot = True, fmt = "d")
plt.title("Confusion Matrix")
plt.xlabel("Predicted Label")
plt.ylabel("True Label")
plt.show()
tempResultDf = pd.DataFrame({'Method':['Random Forest'], 'accuracy': [accuracy_RF]})
resultDf = pd.concat([resultDf, tempResultDf])
resultDf = resultDf[['Method', 'accuracy']]
resultDf
print('Feature Importance for Random Forest Classifier ', '--'*38)
feature_importances = pd.DataFrame(RF.feature_importances_, index = X.columns,
columns=['Importance']).sort_values('Importance', ascending = True)
feature_importances.sort_values(by = 'Importance', ascending = True).plot(kind = 'barh', figsize = (15, 7.2))
Random forest which is Ensemble of Decision Trees employed to find the best accuracy score but it depends mostly on PPE attribute to predict the target class.
This uses all the independent attributes to predict the target class but PPE has the highest feature importance.
Advanatage : High Feature selection
from sklearn.ensemble import BaggingClassifier
BAG = BaggingClassifier(n_estimators=50, max_samples= .7, bootstrap=True, oob_score=True, random_state=22)
BAG.fit(X_train, y_train)
#predict target class on test data
y_pred = BAG.predict(X_test)
accuracy_BAG = accuracy_score(y_test, y_pred)
print('Training Score: ', BAG.score(X_train, y_train).round(3))
print('Test Score: ', BAG.score(X_test, y_test).round(3))
print('Classification Report of Bagging Model:')
print(classification_report(y_test,y_pred))
print('Accuracy: ', accuracy_BAG.round(3))
cm_BAG = metrics.confusion_matrix(y_test, y_pred)
label = ["Parkinsons Disease Affected", " Parkinsons Disease not Affected"]
cm1_BAG = pd.DataFrame(cm_BAG, index = label, columns = label)
sns.heatmap(cm1_BAG, annot = True, fmt = "d")
plt.title("Confusion Matrix")
plt.xlabel("Predicted Label")
plt.ylabel("True Label")
plt.show()
tempResultDf = pd.DataFrame({'Method':['Bagging'], 'accuracy': [accuracy_BAG]})
resultDf = pd.concat([resultDf, tempResultDf])
resultDf = resultDf[['Method', 'accuracy']]
resultDf
# Apply Adaboost Ensemble Algorithm and print the accuracy.
from sklearn.ensemble import AdaBoostClassifier
ADABOOST = AdaBoostClassifier(n_estimators= 50, learning_rate=0.1, random_state=random_state)
ADABOOST.fit(X_train, y_train)
#predict target class on test data
y_pred = ADABOOST.predict(X_test)
accuracy_ADABOOST = accuracy_score(y_test, y_pred)
print('Training Score: ', ADABOOST.score(X_train, y_train).round(3))
print('Test Score: ', ADABOOST.score(X_test, y_test).round(3))
print('Classification Report of ADABOOST Ensemble Model:')
print(classification_report(y_test,y_pred))
print('Accuracy: ', accuracy_ADABOOST.round(3))
cm_ADABOOST = metrics.confusion_matrix(y_test, y_pred)
label = ["Parkinsons Disease Affected", " Parkinsons Disease not Affected"]
cm1_ADABOOST = pd.DataFrame(cm_ADABOOST, index = label, columns = label)
sns.heatmap(cm1_ADABOOST, annot = True, fmt = "d")
plt.title("Confusion Matrix")
plt.xlabel("Predicted Label")
plt.ylabel("True Label")
plt.show()
tempResultDf = pd.DataFrame({'Method':['ADABOOST'], 'accuracy': [accuracy_ADABOOST]})
resultDf = pd.concat([resultDf, tempResultDf])
resultDf = resultDf[['Method', 'accuracy']]
resultDf
#Apply GradientBoost Classifier Algorithm and print the accuracy
from sklearn.ensemble import GradientBoostingClassifier
GB = GradientBoostingClassifier(n_estimators = 20, random_state=random_state)
GB.fit(X_train, y_train)
#predict target class on test data
y_pred =GB.predict(X_test)
accuracy_GB = accuracy_score(y_test, y_pred)
print('Training Score: ', GB.score(X_train, y_train).round(3))
print('Test Score: ', GB.score(X_test, y_test).round(3))
print('Classification Report of Gradient Boosting Classifier Model :')
print(classification_report(y_test,y_pred))
print('Accuracy: ', accuracy_GB.round(3))
cm_GB = metrics.confusion_matrix(y_test, y_pred)
label = ["Parkinsons Disease Affected", " Parkinsons Disease not Affected"]
cm1_GB = pd.DataFrame(cm_GB, index = label, columns = label)
sns.heatmap(cm1_GB, annot = True, fmt = "d")
plt.title("Confusion Matrix")
plt.xlabel("Predicted Label")
plt.ylabel("True Label")
plt.show()
tempResultDf = pd.DataFrame({'Method':['Gradient Boosting'], 'accuracy': [accuracy_GB]})
resultDf = pd.concat([resultDf, tempResultDf])
resultDf = resultDf[['Method', 'accuracy']]
resultDf
best_model = []
best_model.append(('Logisitic Regression', LR ))
best_model.append(('Naive Bayes', NB ))
best_model.append(('SVM', SVM ))
best_model.append(('Decision Tree', DT ))
best_model.append(('Meta-Classifier',model ))
best_model.append(('Random Forest', RF ))
best_model.append(('Bagging', BAG))
best_model.append(('AdaBoost', ADABOOST ))
best_model.append(('GradientBoost', GB ))
# Evaluate each model
output = []
identifier = []
Best_scoring = 'accuracy'
for name, model in best_model:
# Perform k-fold Cross-Validation to evaluate the Performance metrics of all Classification Models
from sklearn import model_selection
kfold = model_selection.KFold(n_splits=3)
cv_output = model_selection.cross_val_score(model, X, y, cv=kfold, scoring=Best_scoring)
output.append(cv_output)
identifier.append(name)
result = "%s: %f " % (name, cv_output.max())
print(result)
# Using Box plot to find the Best Model
fig = plt.figure(figsize=(15,15))
ax = fig.add_subplot(111)
plt.title("Model comparison of all standard classification and Ensemble Model")
plt.boxplot(output);
ax.set_xticklabels(identifier)
plt.show()
Ultimate Goal is to classify the patients into the Parkinsons Disease Affected and Non-affected label using the attributes from their voice recordings Dataset.
Machine Learning algorithms like Standard Classification and Ensemble Models applied to predict accurately for diagnosis Parkinsons Disease, this would be an effective screening step prior to an appointment with a clinician.
Hence from the above model comparison,"Random Forest " model is the Best model score since it provides higher accuracy score through cross validation. Random forest classifier will handle the missing values ,maintain the accuracy of a large proportion of data and are a popular method for feature ranking as it select good features to predict the Target class accurately.